290 research outputs found
Video Object Segmentation using Supervoxel-Based Gerrymandering
Pixels operate locally. Superpixels have some potential to collect
information across many pixels; supervoxels have more potential by implicitly
operating across time. In this paper, we explore this well established notion
thoroughly analyzing how supervoxels can be used in place of and in conjunction
with other means of aggregating information across space-time. Focusing on the
problem of strictly unsupervised video object segmentation, we devise a method
called supervoxel gerrymandering that links masks of foregroundness and
backgroundness via local and non-local consensus measures. We pose and answer a
series of critical questions about the ability of supervoxels to adequately
sway local voting; the questions regard type and scale of supervoxels as well
as local versus non-local consensus, and the questions are posed in a general
way so as to impact the broader knowledge of the use of supervoxels in video
understanding. We work with the DAVIS dataset and find that our analysis yields
an unsupervised method that outperforms all other known unsupervised methods
and even many supervised ones
BubbleNets: Learning to Select the Guidance Frame in Video Object Segmentation by Deep Sorting Frames
Semi-supervised video object segmentation has made significant progress on
real and challenging videos in recent years. The current paradigm for
segmentation methods and benchmark datasets is to segment objects in video
provided a single annotation in the first frame. However, we find that
segmentation performance across the entire video varies dramatically when
selecting an alternative frame for annotation. This paper address the problem
of learning to suggest the single best frame across the video for user
annotation-this is, in fact, never the first frame of video. We achieve this by
introducing BubbleNets, a novel deep sorting network that learns to select
frames using a performance-based loss function that enables the conversion of
expansive amounts of training examples from already existing datasets. Using
BubbleNets, we are able to achieve an 11% relative improvement in segmentation
performance on the DAVIS benchmark without any changes to the underlying method
of segmentation.Comment: CVPR 201
Tukey-Inspired Video Object Segmentation
We investigate the problem of strictly unsupervised video object
segmentation, i.e., the separation of a primary object from background in video
without a user-provided object mask or any training on an annotated dataset. We
find foreground objects in low-level vision data using a John Tukey-inspired
measure of "outlierness". This Tukey-inspired measure also estimates the
reliability of each data source as video characteristics change (e.g., a camera
starts moving). The proposed method achieves state-of-the-art results for
strictly unsupervised video object segmentation on the challenging DAVIS
dataset. Finally, we use a variant of the Tukey-inspired measure to combine the
output of multiple segmentation methods, including those using supervision
during training, runtime, or both. This collectively more robust method of
segmentation improves the Jaccard measure of its constituent methods by as much
as 28%
Kinematically-Informed Interactive Perception: Robot-Generated 3D Models for Classification
To be useful in everyday environments, robots must be able to observe and
learn about objects. Recent datasets enable progress for classifying data into
known object categories; however, it is unclear how to collect reliable object
data when operating in cluttered, partially-observable environments. In this
paper, we address the problem of building complete 3D models for real-world
objects using a robot platform, which can remove objects from clutter for
better classification. Furthermore, we are able to learn entirely new object
categories as they are encountered, enabling the robot to classify previously
unidentifiable objects during future interactions. We build models of grasped
objects using simultaneous manipulation and observation, and we guide the
processing of visual data using a kinematic description of the robot to combine
observations from different view-points and remove background noise. To test
our framework, we use a mobile manipulation robot equipped with an RGBD camera
to build voxelized representations of unknown objects and then classify them
into new categories. We then have the robot remove objects from clutter to
manipulate, observe, and classify them in real-time
Robot-Supervised Learning for Object Segmentation
To be effective in unstructured and changing environments, robots must learn
to recognize new objects. Deep learning has enabled rapid progress for object
detection and segmentation in computer vision; however, this progress comes at
the price of human annotators labeling many training examples. This paper
addresses the problem of extending learning-based segmentation methods to
robotics applications where annotated training data is not available. Our
method enables pixelwise segmentation of grasped objects. We factor the problem
of segmenting the object from the background into two sub-problems: (1)
segmenting the robot manipulator and object from the background and (2)
segmenting the object from the manipulator. We propose a kinematics-based
foreground segmentation technique to solve (1). To solve (2), we train a
self-recognition network that segments the robot manipulator. We train this
network without human supervision, leveraging our foreground segmentation
technique from (1) to label a training set of images containing the robot
manipulator without a grasped object. We demonstrate experimentally that our
method outperforms state-of-the-art adaptable in-hand object segmentation. We
also show that a training set composed of automatically labelled images of
grasped objects improves segmentation performance on a test set of images of
the same objects in the environment
Predicting Future Lane Changes of Other Highway Vehicles using RNN-based Deep Models
In the event of sensor failure, autonomous vehicles need to safely execute
emergency maneuvers while avoiding other vehicles on the road. To accomplish
this, the sensor-failed vehicle must predict the future semantic behaviors of
other drivers, such as lane changes, as well as their future trajectories given
a recent window of past sensor observations. We address the first issue of
semantic behavior prediction in this paper, which is a precursor to trajectory
prediction, by introducing a framework that leverages the power of recurrent
neural networks (RNNs) and graphical models. Our goal is to predict the future
categorical driving intent, for lane changes, of neighboring vehicles up to
three seconds into the future given as little as a one-second window of past
LIDAR, GPS, inertial, and map data.
We collect real-world data containing over 20 hours of highway driving using
an autonomous Toyota vehicle. We propose a composite RNN model by adopting the
methodology of Structural Recurrent Neural Networks (RNNs) to learn factor
functions and take advantage of both the high-level structure of graphical
models and the sequence modeling power of RNNs, which we expect to afford more
transparent modeling and activity than opaque, single RNN models. To
demonstrate our approach, we validate our model using authentic interstate
highway driving to predict the future lane change maneuvers of other vehicles
neighboring our autonomous vehicle. We find that our composite Structural RNN
outperforms baselines by as much as 12% in balanced accuracy metrics
A Critical Investigation of Deep Reinforcement Learning for Navigation
The navigation problem is classically approached in two steps: an exploration
step, where map-information about the environment is gathered; and an
exploitation step, where this information is used to navigate efficiently. Deep
reinforcement learning (DRL) algorithms, alternatively, approach the problem of
navigation in an end-to-end fashion. Inspired by the classical approach, we ask
whether DRL algorithms are able to inherently explore, gather and exploit
map-information over the course of navigation. We build upon Mirowski et al.
[2017] work and introduce a systematic suite of experiments that vary three
parameters: the agent's starting location, the agent's target location, and the
maze structure. We choose evaluation metrics that explicitly measure the
algorithm's ability to gather and exploit map-information. Our experiments show
that when trained and tested on the same maps, the algorithm successfully
gathers and exploits map-information. However, when trained and tested on
different sets of maps, the algorithm fails to transfer the ability to gather
and exploit map-information to unseen maps. Furthermore, we find that when the
goal location is randomized and the map is kept static, the algorithm is able
to gather and exploit map-information but the exploitation is far from optimal.
We open-source our experimental suite in the hopes that it serves as a
framework for the comparison of future algorithms and leads to the discovery of
robust alternatives to classical navigation methods
Learning Object Depth from Camera Motion and Video Object Segmentation
Video object segmentation, i.e., the separation of a target object from
background in video, has made significant progress on real and challenging
videos in recent years. To leverage this progress in 3D applications, this
paper addresses the problem of learning to estimate the depth of segmented
objects given some measurement of camera motion (e.g., from robot kinematics or
vehicle odometry). We achieve this by, first, introducing a diverse, extensible
dataset and, second, designing a novel deep network that estimates the depth of
objects using only segmentation masks and uncalibrated camera movement. Our
data-generation framework creates artificial object segmentations that are
scaled for changes in distance between the camera and object, and our network
learns to estimate object depth even with segmentation errors. We demonstrate
our approach across domains using a robot camera to locate objects from the YCB
dataset and a vehicle camera to locate obstacles while driving.Comment: ECCV 202
Depth from Camera Motion and Object Detection
This paper addresses the problem of learning to estimate the depth of
detected objects given some measurement of camera motion (e.g., from robot
kinematics or vehicle odometry). We achieve this by 1) designing a recurrent
neural network (DBox) that estimates the depth of objects using a generalized
representation of bounding boxes and uncalibrated camera movement and 2)
introducing the Object Depth via Motion and Detection Dataset (ODMD). ODMD
training data are extensible and configurable, and the ODMD benchmark includes
21,600 examples across four validation and test sets. These sets include mobile
robot experiments using an end-effector camera to locate objects from the YCB
dataset and examples with perturbations added to camera motion or bounding box
data. In addition to the ODMD benchmark, we evaluate DBox in other monocular
application domains, achieving state-of-the-art results on existing driving and
robotics benchmarks and estimating the depth of objects using a camera phone.Comment: CVPR 202
Learning Kinematic Descriptions using SPARE: Simulated and Physical ARticulated Extendable dataset
Next generation robots will need to understand intricate and articulated
objects as they cooperate in human environments. To do so, these robots will
need to move beyond their current abilities--- working with relatively simple
objects in a task-indifferent manner--- toward more sophisticated abilities
that dynamically estimate the properties of complex, articulated objects. To
that end, we make two compelling contributions toward general articulated
(physical) object understanding in this paper. First, we introduce a new
dataset, SPARE: Simulated and Physical ARticulated Extendable dataset. SPARE is
an extendable open-source dataset providing equivalent simulated and physical
instances of articulated objects (kinematic chains), providing the greater
research community with a training and evaluation tool for methods generating
kinematic descriptions of articulated objects. To the best of our knowledge,
this is the first joint visual and physical (3D-printable) dataset for the
Vision community. Second, we present a deep neural network that can predit the
number of links and the length of the links of an articulated object. These new
ideas outperform classical approaches to understanding kinematic chains, such
tracking-based methods, which fail in the case of occlusion and do not leverage
multiple views when available
- …